The term Support Vector Machines (SVM's) is sometimes used loosely to refer to three methods:
Each are an extension of the previous method, allowing them to be applied to a broader range of cases.
Support Vector Machines (SVM) are a common discriminative algorithm, well suited to complex small- to medium sized datasetsGéron, which aim to find a hyperplane that provides the maximum margin of separation between classes of objects.
They can be used for both classification and regression.
To demonstrate SVM's I'll be using Fisher's (or Anderson's) Iris flowers datasetFisher.
Image("1_2b9TA6i27eGjeDQE9qUaEg.png") # Image from:
The data set consists of 50 samples from each of three species of Iris, with four features measured from each sample: the length and the width of the sepals and petals, in centimeters.
display(Image("03_iris.png")) # Image from: https://www.ritchieng.com/machine-learning-iris-dataset/
iris_df.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 |
sns.pairplot(iris_vis, hue="target", height=3, aspect = 1.5)
plt.show()
set_versi_plot()
plt.plot(x0, decision_boundary, "g-", linewidth=2)
plt.plot(x0, pred_2, "m-", linewidth=2)
plt.plot(x0, pred_3, "r-", linewidth=2)
plt.show()
Here the middle thick line is a **hyperplane** and the dashed outer lines are the edges of the **Margin**.
plot_svc_decision_boundary(svm_clf, 0, 5.5)
set_versi_plot()
plt.show()
In p-dimensional space, a hyperplane is a flat affine (does not need to pass through the origin) subspace of dimension p-1.
Examples
three_dim()
Two-dimensions: $\beta_0 + \beta_1X_1 + \beta_2X_2 = 0$
Three-dimensions: $\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 = 0$
P-dimensions: $\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p = 0$
If $X = (X_1, ..., X_p)^T$ satisfies above, then it is a point on the hyperplane.
If $\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p > 0$ it lies on one side of the hyperplane and $\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p < 0$ on the other. In-other-words p-dimensional space is dividided into two halves.
We aim to classify an $n \times p$ matrix of $n$ observations in $p$ dimensional space with these observations falling into two classes $y_1,...,y_n \in \{-1,1\}$.
If we were to perfectly separate the classes the hyperplane would have the property that
$\beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + ... + \beta_pX_{ip} > 0$ if $y_i = 1$,
and
$\beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + ... + \beta_pX_{ip} < 0$ if $y_i = -1$.
For new test observations $x^*$, we would assign it to class 1 if $f(x^*)$ is positive and class -1 if negative. Furthermore the magnitude of $f(x^*)$ can be used to indicate how far the point lies from the hyperplane.
We need a reasonable way of constucting a hyperplane, out of the possible choices.
Maximumal margin hyperplanes look at getting the hyperplane that is the furthest from the training obeservations - we compute the perpendicular distance from each training observation to a given separating hyperplane. The maximal margin hyperplane is the separating hyperplane for which the margin is largest.
We hope the classifier with a large margin on the training data will generalise well to unseen test observations.
plot_svc_margin(svm_clf, 0, 5.5)
set_versi_plot()
plt.show()
We can see below there are two equidistant points from the maximal margin hyperplane, lying on the dashed lines. There observations are called Support Vectors, as if these moved so would the hyperplane.
The maximal margin hyperplane only depends on these support vectors, meaning other points could be moved without the hyperplane moving.
[Maybe add the about all other weights being 0 from other lecture]
plot_svc_margin(svm_clf, 0, 5.5, highlight=True)
set_versi_plot()
plt.show()
This maximal margin hyperplane is the solution to the optemisation problem for choosing $\beta_0,\beta_1, ...\beta_p$ to maximise $M$,
subject to $\sum^p_{j=1}\beta^2_j = 1$,
$\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_pX_{ip} \geq M \quad \forall i = 1,...,n$
The constraints above just ensure that each observation is on the correct side of the hyperplane and at least a distance $M$ from the hyperplane; meaning $M$ represents our hyperplane.
However this method is sensitive to outliers. In figure X, we can see that the reliance on a small number of observations means there is now a small margin. We want to be confident that a distance from the hyperlane is a measure of our confidence in its classificaion, and that we have no overfit to our training data.
In other cases, no exact linear separating hyperplane exists (so no solution to $M > 0$). Therefore we may want to use a hyperplane that almost separates the two classes, allowing some errors, using a soft margin (Support Vector Classifier).
Furthermore, if $P$ is large, this approach often leads to overfitting.
hard_margin_limits()
Describe what a support vector is.
In the plot below, which points are the "support vectors"?
[INSERT]
# QUESTION 2
exersise_ = iris_vis[["sepal length (cm)", "sepal width (cm)", "target"]]
exersise_ = exersise_[exersise_.target != "Virginica"]
X = exersise_[["sepal length (cm)", "sepal width (cm)"]].values
y = le.transform(exersise_[["target"]].values.ravel())
# SVM Classifier model
svm_clf = SVC(kernel="linear", C=float("inf"))
svm_clf.fit(X, y)
w = svm_clf.coef_[0]
b = svm_clf.intercept_[0]
x0 = np.linspace(4.2, 7, 200)
decision_boundary = -w[0]/w[1] * x0 - b/w[1]
plt.plot(x0, decision_boundary, "g-", linewidth=2)
g = sns.scatterplot(data=exersise_, x = "sepal length (cm)",
y = "sepal width (cm)", hue="target",
style = "target")
# ANSWERS 2
exersise_ = iris_vis[["sepal length (cm)", "sepal width (cm)", "target"]]
exersise_ = exersise_[exersise_.target != "Virginica"]
X = exersise_[["sepal length (cm)", "sepal width (cm)"]].values
y = le.transform(exersise_[["target"]].values.ravel())
# SVM Classifier model
svm_clf = SVC(kernel="linear", C=float("inf"))
svm_clf.fit(X, y)
w = svm_clf.coef_[0]
b = svm_clf.intercept_[0]
x0 = np.linspace(0, 5.5, 200)
decision_boundary = -w[0]/w[1] * x0 - b/w[1]
plot_svc_decision_boundary(svm_clf, 4.2, 7)
g = sns.scatterplot(data=exersise_, x = "sepal length (cm)",
y = "sepal width (cm)", hue="target",
style = "target")
SVC's are a generalisation and extension of the maximal margin classifier so it can be applied to a broader range of casesJames.
In practice they are more robust to individual observations and better classify most training observations than the Maximal Margin Classifier. This is because they take the approach it is better to missclassify some training examples in order to do a better job classifying the rest.
This is called a soft margin as it allows some violations by the training data by a small subset of training observation, not only on the wrong side of the margin, but wrong side of the hyperplane.
soft_margin("Soft Margin", hyperplane=True)
plt.show()
This is the solution to the optimisation problem for choosing $\beta_0,\beta_1, ...\beta_p$, with a slack variable $\epsilon_1,..., \epsilon_n$, to maximise $M$,
subject to $\sum^p_{j=1}\beta^2_j = 1$,
$\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_pX_{ip} \geq M(1-\epsilon_i)$,
$\epsilon \geq 0$, $\sum^n_{i=1}\epsilon_i \leq C$,
where $C$ is a nonnegative tuning parameter.
$\epsilon_i$ tells us where the $i$th observation is located relative to the hyperplane; $\epsilon_i = 0$ being on the correct side of the margin, $\epsilon_i > 0$ being on the wrong side of the margin, and $\epsilon_i > 1$ on the wrong side of the hyperplane.
$\epsilon_1 = ... = \epsilon_n = 0$ is the maximal margin hyperplane optimisation.
$C$ bounds the sum of $\epsilon_i$'s, so a smaller C creates a wider boundary with a larger number of, and more severe, margin violations. $C$ can therefore be viewed as a budget for violations to the margin.
However, in Sci-kit learn "the strength of the regularization is inversely proportional to C"!
[EXPLAIN WHY THIS IS THE CASE]
soft_margin([1,100])
$C$ is a tuning parameter that controls the bias-variance trade-off. [Insert a line about the bias-variance trade-off]. When $C$ is small we have narrow margins rarely violated, but highly fit to the training data (low bias-high variance). Coversely, when larger, the margin is wider amounting to less hard fitting (high bias-low variance).
Like most hyper-parameters, it is often chosen using cross-validation.
Alike to maximal margin classifiers, SVC's only rely on a few observations, those on the margin or those that violate the margin (Support Vectors) - if they are on the correct side of the margin they dont change the classifier. This does mean that they are robust to observations far away from the hyperplane.
Aims to address the situation where the boundry between two classes is not linear.
import warnings
# TEMP
with warnings.catch_warnings():
warnings.simplefilter("ignore")
non_linear_examples()
We could consider enlarging the feature space using quadratic, cubic or higher-order polynomial functions, however the larger the number of features, the higher computational burden. Instead it is common to enlarge the feature space using an extension of a SVC termed a Support Vector Machine, which uses kernels.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
print("Polynomial Feature Engineering")
%timeit polynomial_feat.fit(X, y)
print("Polynomial Kernel")
%timeit polynomial_svm.fit(X, y)
Polynomial Feature Engineering 1.52 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Polynomial Kernel 1.04 ms ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The kernel apporach is an efficient computational approach to enlarge our feature space to accommodate a non-linear boundary.
Skipping over some technical details (see REF), it turns out we can use the inner products of two observations rather than using the observations themselves:
$<x_i,x_{i^{\prime}}> = \sum^P_{j=1}x_{ij}x_{i^{\prime}j}$.
Using this we can represent the linear support vector classifier as:
$f(x) = \beta_0 + \sum^n_{i=1}\alpha_i<x,x_i>$.
In the above case, for estimating the parameters $\alpha_1...,\alpha_n$ and $\beta_0$, we need the $n(n-1)/2$ inner products $<x,x_i>$ between all pairs of training observations. Similarly, if wanted to compute $f(x)$ we would need to the inner product between $x$ and each training point $x_i$.
However, $\alpha$ is nonzero only for support vectors, so if we have a collection of indicies of these support points we can do the following instead:
$f(x) = \beta_0 + \sum_{i\in S}\alpha_i<x,x_i>$.
Also instead of actually calculating the inner product, we could instead use a generalisation, $K(x,x_{i^{\prime}})$, where $K$ is a kernel. We can now define the classifier as:
$f(x) = \beta_0 + \sum_{i\in S}\alpha_iK(x,x_i)$.
A kernel is a function that quantifies the similarity of two observations. For example, for a linear kernel we could use:
$K(x_i, x_{i'}) = \sum^P_{j=1}x_{ij}x_{i'j}$,
where we quantifiy the similarity of pairs of observations using Pearson (standard) correlation.
However, we could use other forms of kernel to fit the support vector classifier in a higher-dimensional space, such as a polynomial kernel:
$K(x_i, x_{i'}) = (1+\sum^P_{j=1}x_{ij}x_{i'j})^d$,
where d is a positive integer.
[MORE INFO ON polynomial kernels]
poly_info()
Another popular choice is the radial kernel:
$K(x_i, x_{i'}) = exp(-\gamma\sum^P_{j=1}(x_{ij}-x_{i'j})^2)$,
where $\gamma$ is a positive integer.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
non_linear_examples()
When using a SVM model trained using a RBF kernel classifies a test observation $x^* = (x^*_1...x^*_p)T$, only training observaations close to $x^*$ (in terms of Euclidean distance) will play a role in its class label. This is because $(x^*_j-x_{ij})^2$ will be large, so $exp(-\gamma\sum^P_{j=1}(x^*_j-x_{ij})^2)$ will be small.
[MORE INFO ON RBF kernels]
rbk_info()
[MORE INFO ON HOW C AND GAMMA INTERACT]
The Iris dataset is useful for demonstrating SVM's, but are a bit "too small to be representative of real world machine learning tasks"web2. Also with SVM's we'll probably do well no matter how our hyperparamters are set. So lets demonstrate its use on a more (still manageable) realisitic dataset.
| Name | Data Type | Meas. | Description | |
|---|---|---|---|---|
| Sex | nominal | M, F, and I (infant) | ||
| Length | continuous | mm | Longest shell measurement | |
| Diameter | continuous | mm | perpendicular to length | |
| Height | continuous | mm | with meat in shell | |
| Whole weight | continuous | grams | whole abalone | |
| Shucked weight | continuous | grams | weight of meat | |
| Viscera weight | continuous | grams | gut weight (after bleeding) | |
| Shell weight | continuous | grams | after being dried | |
| Rings | integer | +1.5 gives the age in years |
names = ["Sex", "Length", "Diameter", "Height", "Whole weight",
"Shucked weight", "Viscera weight", "Shell weight", "Rings"]
df = pd.read_csv("abalone_data.csv", names=names)
df.head()
| Sex | Length | Diameter | Height | Whole weight | Shucked weight | Viscera weight | Shell weight | Rings | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | M | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.150 | 15 |
| 1 | M | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.070 | 7 |
| 2 | F | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.210 | 9 |
| 3 | M | 0.440 | 0.365 | 0.125 | 0.5160 | 0.2155 | 0.1140 | 0.155 | 10 |
| 4 | I | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.055 | 7 |
sns.pairplot(df, hue="Sex", height=3, aspect = 1.5)
plt.show()
There are three classes for SVM classification in Scikit-LearnGeron:
| Class | Time Complexity | Out-of-core Support | Kernel Trick |
|---|---|---|---|
| LinearSVC | 0(m x n) | No | No |
| SGDClassifier | 0(m x n) | Yes | No |
| SVC | 0(m2 x n) to 0(m3 x n) | No | Yes |
First lets make a pipeline with two steps:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X.values, y_bin.values, test_size = 0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state=42)
linear_svm = Pipeline([
("scaler", StandardScaler()),
("svm_clf", LinearSVC(random_state=42))
])
rbf_svm = Pipeline([
("scaler", StandardScaler()),
("svm_clf", SVC(random_state=42))])
linear_svm.fit(X_train, y_train)
rbf_svm.fit(X_train, y_train)
print("linear_svm")
display(pd.DataFrame(classification_report(y_val, linear_svm.predict(X_val), output_dict=True)))
print("rbf_svm")
display(pd.DataFrame(classification_report(y_val, rbf_svm.predict(X_val), output_dict=True)))
linear_svm
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.892938 | 0.665217 | 0.814649 | 0.779078 | 0.824860 |
| recall | 0.835821 | 0.765000 | 0.814649 | 0.800410 | 0.814649 |
| f1-score | 0.863436 | 0.711628 | 0.814649 | 0.787532 | 0.818053 |
| support | 469.000000 | 200.000000 | 0.814649 | 669.000000 | 669.000000 |
rbf_svm
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.884120 | 0.719212 | 0.834081 | 0.801666 | 0.834820 |
| recall | 0.878465 | 0.730000 | 0.834081 | 0.804232 | 0.834081 |
| f1-score | 0.881283 | 0.724566 | 0.834081 | 0.802925 | 0.834432 |
| support | 469.000000 | 200.000000 | 0.834081 | 669.000000 | 669.000000 |
So that was quite simple, but we are only using the default hyperparamters which is not good. Lets look at finding some better hyperparameters.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats.distributions import uniform
from time import time
# specify parameters and distributions to sample from
param_dist = {'svm_clf__class_weight': [None, "balanced"],
'svm_clf__C':list(range(1,50))}
# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(linear_svm, param_distributions=param_dist,
n_iter=n_iter_search, random_state=42)
start = time()
with warnings.catch_warnings():
warnings.simplefilter("ignore")
random_search.fit(X_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates" % ((time() - start), n_iter_search))
pd.DataFrame(random_search.cv_results_).sort_values("rank_test_score")[["param_svm_clf__class_weight", "param_svm_clf__C", "mean_test_score", "std_test_score"]].head()
RandomizedSearchCV took 6.26 seconds for 20 candidates
| param_svm_clf__class_weight | param_svm_clf__C | mean_test_score | std_test_score | |
|---|---|---|---|---|
| 14 | None | 23 | 0.823733 | 0.011409 |
| 17 | None | 7 | 0.822235 | 0.013053 |
| 9 | None | 1 | 0.821860 | 0.012871 |
| 15 | None | 3 | 0.821860 | 0.012871 |
| 1 | None | 21 | 0.821857 | 0.011044 |
# specify parameters and distributions to sample from
param_dist = {'svm_clf__class_weight': [None, "balanced"],
'svm_clf__C':list(range(1,50)),
'svm_clf__gamma':uniform(0,1)}
# run randomized search
n_iter_search = 20
random_search_rbf = RandomizedSearchCV(rbf_svm, param_distributions=param_dist,
n_iter=n_iter_search, random_state=42)
start = time()
random_search_rbf.fit(X_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates" % ((time() - start), n_iter_search))
pd.DataFrame(random_search_rbf.cv_results_).sort_values("rank_test_score")[["param_svm_clf__class_weight", "param_svm_clf__C", "mean_test_score", "std_test_score"]].head()
RandomizedSearchCV took 12.58 seconds for 20 candidates
| param_svm_clf__class_weight | param_svm_clf__C | mean_test_score | std_test_score | |
|---|---|---|---|---|
| 14 | None | 21 | 0.832334 | 0.011647 |
| 13 | None | 3 | 0.828966 | 0.010244 |
| 7 | None | 21 | 0.828593 | 0.005631 |
| 15 | None | 4 | 0.828219 | 0.007217 |
| 11 | None | 16 | 0.827469 | 0.005509 |
Just from above we have improved our RBF scores, although no linear SVM much.
print("linear_svm")
display(pd.DataFrame(classification_report(y_val, random_search.best_estimator_.predict(X_val), output_dict=True)))
print("rbf_svm")
display(pd.DataFrame(classification_report(y_val, random_search_rbf.best_estimator_.predict(X_val), output_dict=True)))
linear_svm
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.897912 | 0.655462 | 0.811659 | 0.776687 | 0.825431 |
| recall | 0.825160 | 0.780000 | 0.811659 | 0.802580 | 0.811659 |
| f1-score | 0.860000 | 0.712329 | 0.811659 | 0.786164 | 0.815853 |
| support | 469.000000 | 200.000000 | 0.811659 | 669.000000 | 669.000000 |
rbf_svm
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.881607 | 0.734694 | 0.838565 | 0.808150 | 0.837687 |
| recall | 0.889126 | 0.720000 | 0.838565 | 0.804563 | 0.838565 |
| f1-score | 0.885350 | 0.727273 | 0.838565 | 0.806312 | 0.838092 |
| support | 469.000000 | 200.000000 | 0.838565 | 669.000000 | 669.000000 |
In order to reduce a models complexity, run time, and potential for overfitting to the training data, dimension reduction techniques can be used. Broadly they can be grouped into methods that create a subset of the original set of features (Feature Selection) and methods that create new synthetic features through combining the original features and discarding less important ones (Feature Extraction). Essentially we want to remove "uninformative infromation" and retain useful bitsZheng. If you have too many features, it may be that some of them are highly correlated and therefore redundant. Therefore we can either select just some of them, or compress them onto a lower dimensional subspaceRaschka.
train_df = pd.DataFrame(X_train, columns=names[1:])
correlations = train_df.corr()
# plot correlation matrix
sns.heatmap(correlations,
vmin=-1,
vmax=1,
xticklabels=correlations.columns.values,
yticklabels=correlations.columns.values,
cmap="RdBu_r"
)
plt.xticks(rotation=45,ha='right')
plt.tight_layout()
plt.show()
A computationally efficient method of selecting features is to use a filter method. Filter methods aim to remove features with a low potential to predict outputs; usually though univariate analysis before classification. A filter could be a threshold set on each features variance, or the correlation or mutual information between each feature and the response variable. Although filters are computationally efficient comparative to other feature selection methods, they are independent of the model chosen so should be used conservatively to ensure data is not removed that a model may find useful1.
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold.
Mutual information estimation measures the dependency between two variables by estimating the entropy from k-nearest neighbors distances. If two random variables are independent this equals 0, with higher dependency producing larger numbers1.
For Feature Selection a threshold can be set based on the number of features you want to keep or a value theshold.
from sklearn.feature_selection import mutual_info_classif
plt.figure(figsize=(15,10))
X_train_ = StandardScaler().fit_transform(X_train)
mi = mutual_info_classif(X_train_, y_train)
mi_series = pd.Series(mi, index=feat_labels).sort_values(ascending = False)
mi_series.plot.bar(legend = False, figsize=(15,10))
plt.title('Mutual information')
plt.xticks(rotation=45,ha='right')
plt.tight_layout()
plt.show()
Instead of being independent, feature selection methods can be embedded in the model training process. An example would be the l1 regularizer for linear models, which imposes a sparsity constraint on the model to ensure a model favours fewer features. These methods are efficient and specific to the chosen model, but are not as powerful at wrapper methods (discussed next)1.
Below is just an example of how you could implement it in a pipeline using a Support Vector Machine.
Wrapper methods are also specific to the chosen model as they directly optimise the accuracy of a classifier by trying subsets of features. This enables keeping features that are useful in combination with others, even if uninformative in isolation1. Wrapper methods are the most computationally expensive, especially when used with nonlinear classifiers.
Wrapper feature selectors can be exhaustive by sampling all possible feature combinations and finding the combination that provides the greatest classifier performance. However these models might overfit to the training data and not be logical.
Lets first create a smaller dataset to demonstrate this on by only using a random subset of the features.
%%time
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
pipe_svc = Pipeline([('scl', StandardScaler()),
('svm_clf', SVC(kernel='rbf',
random_state=42))])
pipe_svc.set_params(**random_search_rbf.best_params_)
efs = ExhaustiveFeatureSelector(pipe_svc,
min_features=1,
max_features=X_train.shape[1],
print_progress=False,
scoring='accuracy',
cv=StratifiedKFold(),
n_jobs=-1,
clone_estimator=True)
efs.fit(X_train, y_train, custom_feature_names=names[1:])
df = pd.DataFrame.from_dict(efs.get_metric_dict()).T
df.sort_values('avg_score', inplace=True, ascending=False)
print(color.BOLD+color.UNDERLINE+"Best Found Feature Combination"+color.END)
display(df['feature_names'].iloc[0])
display(df.head())
Best Found Feature Combination
('Length', 'Whole weight', 'Shucked weight', 'Shell weight', 'Rings')
| feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
|---|---|---|---|---|---|---|---|
| 194 | (0, 3, 4, 6, 7) | [0.8392523364485981, 0.8205607476635514, 0.827... | 0.835333 | (Length, Whole weight, Shucked weight, Shell w... | 0.014514 | 0.011292 | 0.005646 |
| 230 | (0, 1, 3, 4, 6, 7) | [0.8392523364485981, 0.8205607476635514, 0.827... | 0.834959 | (Length, Diameter, Whole weight, Shucked weigh... | 0.013732 | 0.010684 | 0.005342 |
| 174 | (0, 1, 3, 4, 7) | [0.8392523364485981, 0.8186915887850468, 0.827... | 0.834585 | (Length, Diameter, Whole weight, Shucked weigh... | 0.015180 | 0.011811 | 0.005905 |
| 119 | (0, 3, 4, 7) | [0.8429906542056075, 0.8130841121495327, 0.823... | 0.833837 | (Length, Whole weight, Shucked weight, Rings) | 0.018335 | 0.014266 | 0.007133 |
| 235 | (0, 2, 3, 4, 6, 7) | [0.8392523364485981, 0.8186915887850468, 0.829... | 0.833087 | (Length, Height, Whole weight, Shucked weight,... | 0.012036 | 0.009365 | 0.004682 |
Wall time: 27.8 s
metric_dict = efs.get_metric_dict()
fig = plt.figure(figsize=(15,10))
k_feat = sorted(metric_dict.keys())
avg = [metric_dict[k]['avg_score'] for k in k_feat]
upper, lower = [], []
for k in k_feat:
upper.append(metric_dict[k]['avg_score'] +
metric_dict[k]['std_dev'])
lower.append(metric_dict[k]['avg_score'] -
metric_dict[k]['std_dev'])
plt.fill_between(k_feat,
upper,
lower,
alpha=0.2,
color='blue',
lw=1)
plt.plot(k_feat, avg, color='blue', marker='o')
plt.ylabel('Accuracy +/- Standard Deviation')
plt.xlabel('Number of Features')
feature_min = len(metric_dict[k_feat[0]]['feature_idx'])
feature_max = len(metric_dict[k_feat[-1]]['feature_idx'])
plt.xticks(k_feat,
[str(metric_dict[k]['feature_names']) for k in k_feat],
rotation=45, ha='right')
plt.show()
Alternatively you can select features sequentially by removing or adding features until a subset of the desired size is reached. This is a suboptimal solution but is computationally less expensive1
As can be seen we can find a model with 4 features from our full data in a short time relative to the previous model.
%%time
from mlxtend.feature_selection import SequentialFeatureSelector
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
sfs = SequentialFeatureSelector(pipe_svc,
k_features=4,
forward=True,
floating=False,
#verbose=2,
scoring='accuracy',
cv=StratifiedKFold(),
n_jobs=-1)
sfs.fit(X_train_reduced, y_train, custom_feature_names=feat_labels[indices])
display(pd.DataFrame.from_dict(sfs.get_metric_dict()).T)
| feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
|---|---|---|---|---|---|---|---|
| 1 | (1,) | [0.7925233644859813, 0.7925233644859813, 0.784... | 0.793788 | (Whole weight,) | 0.008488 | 0.006604 | 0.003302 |
| 2 | (1, 2) | [0.7813084112149533, 0.788785046728972, 0.7996... | 0.798663 | (Whole weight, Shucked weight) | 0.015705 | 0.012219 | 0.006109 |
| 3 | (0, 1, 2) | [0.7869158878504673, 0.7906542056074767, 0.799... | 0.800158 | (Shell weight, Whole weight, Shucked weight) | 0.013513 | 0.010514 | 0.005257 |
| 4 | (0, 1, 2, 4) | [0.788785046728972, 0.7831775700934579, 0.8071... | 0.797913 | (Shell weight, Whole weight, Shucked weight, H... | 0.012848 | 0.009996 | 0.004998 |
Wall time: 1.8 s
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()
A method growing in popularity is to use model stacking, where the input to one model is the output of another. This allows for nonlinearities to be captured in the first model, and the potential to use a simple linear model as the last layer. Deep learning is an example of model stacking as, often neural networks are layered on top of one another, to optimize both the features and the classifier simultaneously1.
An example of model stacking is to use the output of a decision tree–type model as input to a linear classifier. We can gain the importance for each feature by getting the average impurity decrease computed from all decision trees in the forest without regarding the linear separability of the classes. However, if features are highly correlated, one feature may be ranked highly while the information of the others not being fully captured1.
from collections import Counter
print(Counter(y_train))
Counter({0: 1797, 1: 875})
TODO
Some models (e.g. tree-based classifiers) are inherently multiclass, whereas other machine learning algorithms are able to be extended to multi-class classification using techniques such as the One-versus-Rest or One-versus-One methodsGéron.
The One-verses-all approach is were you train a classifier for each class and select the class from the classifier that outputs the highest scoreGéron. In other terms, if we fit $K$ SVMs, we assign a test observation ($x^*$) to the class for which $\beta_{0k} + \beta_{1k}x^*_1, ...,\beta_{pk}x^*_p$ is largest (the most confident)James.
As each class is fitted against all other classes for each classifier, it is relatively interpretableweb1.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score
rbf = SVC(C=100, kernel='rbf', random_state=42)
multi_rbf = Pipeline([
('scl', StandardScaler()),
('clf', OneVsRestClassifier(rbf))])
display(cross_val_score(multi_rbf, iris_df, target, cv=3))
array([0.96, 0.92, 0.94])
from mlxtend.plotting import plot_decision_regions
multi_rbf.fit(iris_df[["petal length (cm)", "petal width (cm)"]], target)
plot_decision_regions(iris_df[["petal length (cm)", "petal width (cm)"]].values,
target.values,
clf = multi_rbf)
plt.xlabel("X LABEL")
plt.ylabel("Y LABEL")
#plt.xlim(0,.6)
#plt.ylim(0,1.)
plt.show()
Another strategy is to use a OneVsOne approach. This trains $N \times (N-1) / 2$ classifiers by comparing each class against each other so when a prediction is made, the class that is selected the most is chosenGéron (we'll get more onto Bagging next week). It is useful where algorithms do not scale well with data size (such as SVM) because each training and prediction is only needed to be run on a small subset of the data for each classiferGéron,web1.
from sklearn.multiclass import OneVsOneClassifier
multi_rbf = Pipeline([
('scl', StandardScaler()),
('clf', OneVsOneClassifier(rbf))])
display(cross_val_score(multi_rbf, iris_df, target, cv=3))
array([0.94, 0.92, 0.96])
from mlxtend.plotting import plot_decision_regions
multi_rbf.fit(iris_df[["petal length (cm)", "petal width (cm)"]], target)
plot_decision_regions(iris_df[["petal length (cm)", "petal width (cm)"]].values,
target.values,
clf = multi_rbf)
plt.xlabel("X LABEL")
plt.ylabel("Y LABEL")
#plt.xlim(0,.6)
#plt.ylim(0,1.)
plt.show()
A group of classifiers don't have to all be SVM's. Indeed Scikitlearn has a VotingClassifier where multipule classification pipelines can be combined to create an even better classifier that aggregates predictions. This aggregation can be done by simply selecting the class label that has been predicted by the majority of the classifiers (more than 50% of votes) for 'hard voting'. Majority vote refers to binary class decisions but can be generalized to a multi-class setting using 'plurality voting'. Particular classifiers return the probability of a predicted class label via the predict_proba method and this can be used for 'soft voting' instead of class labels1.
Ensemble methods work best when the predictors are as independent as possible, so one way of achiving this is to get diverse classifiers. This increases the chance they each make different types of errors which in combination will improve the overall accuracy2.
As can be seen below the soft majority voter has better scores than the hard voting method and better than most other methods individually when all are on their default settings. Soft voting often achives a higher performance than hard voting because highly confident votes are given more weight2.
NOTE
web1. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html web2. https://scikit-learn.org/stable/datasets/toy_dataset.html